This file records the processing of OTU tables with the LULU algorithm for the manuscript "Reliable biodiversity metrics from co-occurence based post-clustering curation of amplicon data". 20 (unprocessed) OTU tables and their corresponding files with representative sequences (centroids) have been constructed with different algorithms (VSEARCH, SWARM, DADA2 and DADA2+VSEARCH) and are now ready to be curated by the LULU algorithm.

This step should be carried out after the taxonomic filtering of primary OTU tables documentet in the file: D_Taxonomic_filtering.Rmd
NB: All markdown chuncks are set to "eval=FALSE". Change these accordingly. Also code blocks to be run outside R, has been #'ed out. Change this accordingly.

Running the curation with LULU for all initial OTU tables produced with VSEARCH, SWARM, DADA2 and CROP

Bioinformatic tools necessary

Make sure that you have the following bioinformatic tools in your PATH
VSEARCH v.2.02 or later ( Blastn 2.4.0+ or later blastn v2.4.0+ (
LULU - (

Provided scripts

A number of scripts are provided with this manuscript. Place these in you /bin directory and make them executable with "chmod 755 SCRIPTNAME"" or place the scripts in the directory/directories where they should be executed (i.e. the analyses directory)

Analysis files

This step is dependent on the presence of otutables from the inital rounds.

Setting directories and libraries etc

main_path <- "~/analyses" 
path <- file.path(main_path, "otutables_processing")

Producing match lists

Match lists can be produced with different algorithms mathing all centroids against each other. Here we use blastn as this gives a better matching of erroneous sequences caused by "deletions/insertions" and/or sequences of unequal length. This task will take some time, as several of the datasets contain a high number of centroids. Produce match lists for all centroid files (xxx.plantcentroids)

Now we have three matching files for each dataset (otutable (XXX.planttable),centroids (XXX.plantcentroids) and a match list (XXX.centroids.matchlist)). Now we can run the LULU algorithm for each otutable from each analyses using the file pairs XXX.planttable and XXX.centroids.matchlist

Running the LULU algorithm on the datasets

Now we are ready to run the LULU algorithm on the datasets. The commands below uses the LULU algorithm as a function defined in the separate r source file, LULU.r.
Process the datasets with LULU

allFiles <- list.files(path)
allTabs <- allFiles[grepl("planttable$", allFiles)]
allMls <- allFiles[grepl("matchlist$", allFiles)]
tab_names <- sort(as.vector(
 sapply(allTabs, function(x) strsplit(x, ".planttable")[[1]][1])))
ml_names <- sort(as.vector(
 sapply(allMls,function(x) strsplit(x,".plantcentroids.matchlist")[[1]][1])))

if (!all(tab_names == ml_names)) {
  stop("not mathing set of otutables and matchlists", call.=FALSE)

read_tabs <- file.path(path, allTabs)
read_mls <- file.path(path, allMls)
proc_files <- file.path(path, paste0("LULU-",tab_names))
proc_tabs <- file.path(path, paste0(allTabs,"_luluprocessed"))
# Vector for filtering... at this step redundant, but included for safety
samples <- c("S001","S002","S003","S004","S005","S006","S007","S008","S067",

tab <- list()
ml <- list()

for(i in seq(1:length(read_tabs))) {
  tab[[i]] <- read.csv(read_tabs[i],sep='\t',header=T,,row.names = 1)
  tab[[i]] <- tab[[i]][which(rowSums(tab[[i]]) > 0),samples]
  ml[[i]] <- read.csv(read_mls[i],sep='\t',header=F,
  proc_min <- lulu(tab[[i]],ml[[i]]) ## RUNNING LULU for each table
  curated_table <- proc_min$curated_table ## extracting the curated table (line updated 17/01/2017)
  {write.table(curated_table, proc_tabs[i], sep="\t",quote=FALSE, 
               col.names = NA)}

Now we have a processed table ("XXX.planttable_luluprocessed") for each original table ("XXX.planttable"), and these can be compared.

